Assignment 01.2

Author: Igor Matheus S. Moreira

Data set: Brazilian houses to rent

This work aims at answering the following three hypotheses:

Summary

  1. Requirements
  2. Loading the data
  3. Preprocessing the data
  4. Exploring the data
    1. How representative is this data set country-wise?
    2. Does more expensive real estate in terms of rent tend to be buildings?
    3. Are there factors that influence the permission or prohibition of animals?

Requirements

This notebook was produced using python $3.7.6$ (with conda) and has the following requirements:

Back to the top

Loading the data

The data set used herein is brazilian_houses_to_rent. In particular, the houses_to_rent_v2.csv file is used.

Back to the top

Preprocessing the data

Before we begin, it would be nice to convert the floor column to a numeric type, as well as correct the typo in the animal column. When converting the dtype of floor, we will assume that all - entries (which amount to $23%$ of the entries) are houses and that the remaining entries (where floor is specified) are buildings. We will use this assumption to create the type of real estate column.

Now, houses must receive the latitude and longitude of the cities.

Finally, to ensure the best visualization results, a version of houses with no outliers in hoa (R$), rent amount (R$), property tax (R$), fire insurance (R$), and total (R$) will be produced and termed houses_no_outliers.

Back to the top

Exploring the data

Once the data is properly loaded and preprocessed, it is time to explore the data and answer the questions asked at the beginning of this notebook. Before we begin, let us load a topoJSON file to print Brazil in a map.

How representative is this data set country-wise?

To answer this question, it would be interesting to see a geoplot with circle marks whose size change based on how many listings each city has. In addition, an accompanying barplot would be pertinent to see more clearly how the listings are distributed across the cities.

It is noticeable from the map that São Paulo has more listings than the other cities—in fact, São Paulo alone has 5887 listings, whereas the listings of all other states combined amount to 4805 listings. Given how unbalanced this data set is and how few cities it describes, it is safe to assume that this data set is not representative of the real estate situation in the whole country. This is more clearly depicted in the bar plot above. Both visualizations are linked and interactive, allowing the user to quickly see where each city is spatially located in Brazil.

Back to the top

Does more expensive real estate in terms of rent tend to be buildings?

As said in Preprocessing the data, the entries with - (which were transformed to 0) were assumed to represent houses, whereas the remaining ones represent buildings. This assumption is not explicitly backed up by the data and their description, but it is regarded as pertinent since (most) houses are not customarily rented/sold by floor.

Some aspects can be observed from the visualizations above. Firstly, employing houses_no_outliers instead of houses improves the visualization by removing extreme and mild outliers. Hence, the space is better used and the user can better perceive how the listings are distributed in terms of rent costs. Overall, we see that the majority of the listings are buildings, particularly in the bins with lower rent amounts. It can also be noticed that there is not a balanced sample of listings for each price range.

This difficults making assumptions, as the most of the expensive price ranges contain less than $200$ instances per bin. By normalizing the histogram bins, we can see a tendency of more house listings as the price range increases (proportionally speaking). It would be nice to have additional details on the surroundings of these listings to see if their location plays a significant role in the rent amount; however, since the data set only provides the city of the listing, this analysis cannot be made. To succinctly answer the question, there seems to be a bigger presence of houses in more expensive listings; however, overall listings of apartments are far more frequent.

Back to the top

Are there factors that influence the permission or prohibition of animals?

One take at answering this question could be to use similar visualizations to those seen in the previous question. This time, only plots made with houses_no_outliers will be displayed, as it allows the user to better see how the data is distributed.

These plots enable us to see if there is a change in the rate of acceptance/rejection of animals as the rent amount increases, thus allowing us to ascertain if the rent cost has an influence in pet acceptance. This time around, the proportions of listings that accept pets or not in the normalized stacked bar plot are more constant. In the visualizations that were used to answer the previous question, there was a more visible tendency of growth for house listings as the rent prices went up. Here, however, this tendency does not seem as present. Yet again, the fact that quite some of the most expensive price ranges is represented by fewer than $200$ instances means that the actual distributions could vary significantly. The conclusion here is that, regardless of income, the majority of listings accepts animals.

This analysis could be complemented by also considering other factors: does the city significantly influences pet acceptance? What about if the real estate is a house or a building?

These visualizations allow us to see which how the listings are distributed by city in terms of accepting pets or not with observance to the rent amount. In addition, these plots can be filtered by real estate type. In other words, four attributes are considered in these plots: rent amount, city, type of real estate, and acceptance of pets.

They show a similar trend to what was observed in the histograms: regardless of income, real estate type, and city, the majority of listings have no objection to animals. To answer the proposed question, it is relatively safe to assume that there is not a feature in this data set that is strongly correlated with the acceptance or prohibition of pets. As such, more information would be needed to further investigate this matter.

Back to the top